Author: Stacey Beck
This is a comparison of households across Santa Clara County in California block areas using:
farmers market location
may want to consider using households with SNAP benefits if available from census
# Import packages for visualizing spatial data
!pip install contextily geopandas mapclassify
import geopandas as gpd
import pandas as pd
import seaborn as sns
import contextily as ctx
import matplotlib.pyplot as plt
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: contextily in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (1.2.0) Requirement already satisfied: geopandas in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (0.10.2) Collecting mapclassify Downloading mapclassify-2.4.3-py3-none-any.whl (38 kB) Requirement already satisfied: rasterio in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from contextily) (1.2.10) Requirement already satisfied: requests in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from contextily) (2.22.0) Requirement already satisfied: xyzservices in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from contextily) (2022.6.0) Requirement already satisfied: matplotlib in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from contextily) (3.3.2) Requirement already satisfied: pillow in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from contextily) (9.1.1) Requirement already satisfied: geopy in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from contextily) (2.2.0) Requirement already satisfied: mercantile in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from contextily) (1.2.1) Requirement already satisfied: joblib in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from contextily) (0.16.0) Requirement already satisfied: shapely>=1.6 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from geopandas) (1.8.2) Requirement already satisfied: fiona>=1.8 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from geopandas) (1.8.21) Requirement already satisfied: pandas>=0.25.0 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from geopandas) (1.3.5) Requirement already satisfied: pyproj>=2.2.0 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from geopandas) (3.2.1) Requirement already satisfied: scikit-learn in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from mapclassify) (0.23.2) Requirement already satisfied: numpy>=1.3 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from mapclassify) (1.21.6) Requirement already satisfied: networkx in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from mapclassify) (2.4) Requirement already satisfied: scipy>=1.0 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from mapclassify) (1.5.2) Requirement already satisfied: setuptools in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from fiona>=1.8->geopandas) (60.10.0) Requirement already satisfied: munch in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from fiona>=1.8->geopandas) (2.5.0) Requirement already satisfied: click-plugins>=1.0 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from fiona>=1.8->geopandas) (1.1.1) Requirement already satisfied: certifi in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from fiona>=1.8->geopandas) (2022.6.15) Requirement already satisfied: six>=1.7 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from fiona>=1.8->geopandas) (1.12.0) Requirement already satisfied: cligj>=0.5 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from fiona>=1.8->geopandas) (0.7.2) Requirement already satisfied: click>=4.0 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from fiona>=1.8->geopandas) (8.0.3) Requirement already satisfied: attrs>=17 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from fiona>=1.8->geopandas) (21.4.0) Requirement already satisfied: pytz>=2017.3 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas>=0.25.0->geopandas) (2019.2) Requirement already satisfied: python-dateutil>=2.7.3 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from pandas>=0.25.0->geopandas) (2.8.2) Requirement already satisfied: geographiclib<2,>=1.49 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from geopy->contextily) (1.52) Requirement already satisfied: cycler>=0.10 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from matplotlib->contextily) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from matplotlib->contextily) (2.4.7) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from matplotlib->contextily) (1.2.0) Requirement already satisfied: decorator>=4.3.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from networkx->mapclassify) (4.4.0) Requirement already satisfied: affine in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from rasterio->contextily) (2.3.1) Requirement already satisfied: snuggs>=1.4.1 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from rasterio->contextily) (1.4.7) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from requests->contextily) (3.0.4) Requirement already satisfied: idna<2.9,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from requests->contextily) (2.8) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from requests->contextily) (1.25.11) Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from scikit-learn->mapclassify) (2.1.0) Requirement already satisfied: importlib-metadata in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from click>=4.0->fiona>=1.8->geopandas) (4.8.1) Requirement already satisfied: zipp>=0.5 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from importlib-metadata->click>=4.0->fiona>=1.8->geopandas) (3.6.0) Requirement already satisfied: typing-extensions>=3.6.4 in /Users/staceyrivet/Library/Python/3.7/lib/python/site-packages (from importlib-metadata->click>=4.0->fiona>=1.8->geopandas) (4.1.1) Installing collected packages: mapclassify Successfully installed mapclassify-2.4.3
# Get the data for census income
!ls data
acs2020_5yr_B19001_14000US06085511302.geojson fips.png metadata.json
income_geo_df = gpd.read_file('data/acs2020_5yr_B19001_14000US06085511302.geojson')
income_geo_df.head()
| geoid | name | B19001001 | B19001001, Error | B19001002 | B19001002, Error | B19001003 | B19001003, Error | B19001004 | B19001004, Error | ... | B19001013, Error | B19001014 | B19001014, Error | B19001015 | B19001015, Error | B19001016 | B19001016, Error | B19001017 | B19001017, Error | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 05000US06085 | Santa Clara County, CA | 635314.0 | 1803.0 | 18079.0 | 1038.0 | 14662.0 | 869.0 | 13056.0 | 881.0 | ... | 2002.0 | 58589.0 | 1853.0 | 52524.0 | 1527.0 | 85980.0 | 2031.0 | 192333.0 | 2782.0 | MULTIPOLYGON (((-122.04412 37.20049, -122.0440... |
| 1 | 14000US06085500100 | Census Tract 5001, Santa Clara, CA | 2689.0 | 350.0 | 85.0 | 70.0 | 150.0 | 114.0 | 75.0 | 61.0 | ... | 131.0 | 183.0 | 120.0 | 324.0 | 147.0 | 376.0 | 181.0 | 728.0 | 304.0 | MULTIPOLYGON (((-121.90463 37.36078, -121.9044... |
| 2 | 14000US06085500200 | Census Tract 5002, Santa Clara, CA | 1968.0 | 252.0 | 57.0 | 58.0 | 14.0 | 21.0 | 11.0 | 20.0 | ... | 163.0 | 167.0 | 95.0 | 284.0 | 119.0 | 306.0 | 116.0 | 546.0 | 240.0 | MULTIPOLYGON (((-121.91125 37.35418, -121.9108... |
| 3 | 14000US06085500300 | Census Tract 5003, Santa Clara, CA | 1540.0 | 185.0 | 33.0 | 27.0 | 0.0 | 13.0 | 13.0 | 24.0 | ... | 153.0 | 136.0 | 64.0 | 89.0 | 49.0 | 342.0 | 104.0 | 358.0 | 98.0 | MULTIPOLYGON (((-121.92189 37.34604, -121.9218... |
| 4 | 14000US06085500400 | Census Tract 5004, Santa Clara, CA | 1049.0 | 113.0 | 27.0 | 19.0 | 5.0 | 8.0 | 40.0 | 29.0 | ... | 74.0 | 72.0 | 41.0 | 90.0 | 50.0 | 113.0 | 58.0 | 217.0 | 55.0 | MULTIPOLYGON (((-121.93034 37.33913, -121.9303... |
5 rows × 37 columns
income_geo_df.shape
(1582, 37)
# View data types and if missing values
income_geo_df.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 1582 entries, 0 to 1581 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geoid 1582 non-null object 1 name 1582 non-null object 2 B19001001 1582 non-null float64 3 B19001001, Error 1582 non-null float64 4 B19001002 1582 non-null float64 5 B19001002, Error 1582 non-null float64 6 B19001003 1582 non-null float64 7 B19001003, Error 1582 non-null float64 8 B19001004 1582 non-null float64 9 B19001004, Error 1582 non-null float64 10 B19001005 1582 non-null float64 11 B19001005, Error 1582 non-null float64 12 B19001006 1582 non-null float64 13 B19001006, Error 1582 non-null float64 14 B19001007 1582 non-null float64 15 B19001007, Error 1582 non-null float64 16 B19001008 1582 non-null float64 17 B19001008, Error 1582 non-null float64 18 B19001009 1582 non-null float64 19 B19001009, Error 1582 non-null float64 20 B19001010 1582 non-null float64 21 B19001010, Error 1582 non-null float64 22 B19001011 1582 non-null float64 23 B19001011, Error 1582 non-null float64 24 B19001012 1582 non-null float64 25 B19001012, Error 1582 non-null float64 26 B19001013 1582 non-null float64 27 B19001013, Error 1582 non-null float64 28 B19001014 1582 non-null float64 29 B19001014, Error 1582 non-null float64 30 B19001015 1582 non-null float64 31 B19001015, Error 1582 non-null float64 32 B19001016 1582 non-null float64 33 B19001016, Error 1582 non-null float64 34 B19001017 1582 non-null float64 35 B19001017, Error 1582 non-null float64 36 geometry 1582 non-null geometry dtypes: float64(34), geometry(1), object(2) memory usage: 457.4+ KB
# Quick view of county
income_geo_df.plot()
plt.show()
# Random block group
income_geo_df.sample().plot()
plt.show()
image source:

# First 5 geoids
income_geo_df.geoid.head()
0 05000US06085 1 14000US06085500100 2 14000US06085500200 3 14000US06085500300 4 14000US06085500400 Name: geoid, dtype: object
# Delete top row of the data frame (just a summary)
income_geo_df = income_geo_df.drop([0])
income_geo_df.head()
| geoid | name | B19001001 | B19001001, Error | B19001002 | B19001002, Error | B19001003 | B19001003, Error | B19001004 | B19001004, Error | ... | B19001013, Error | B19001014 | B19001014, Error | B19001015 | B19001015, Error | B19001016 | B19001016, Error | B19001017 | B19001017, Error | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 14000US06085500100 | Census Tract 5001, Santa Clara, CA | 2689.0 | 350.0 | 85.0 | 70.0 | 150.0 | 114.0 | 75.0 | 61.0 | ... | 131.0 | 183.0 | 120.0 | 324.0 | 147.0 | 376.0 | 181.0 | 728.0 | 304.0 | MULTIPOLYGON (((-121.90463 37.36078, -121.9044... |
| 2 | 14000US06085500200 | Census Tract 5002, Santa Clara, CA | 1968.0 | 252.0 | 57.0 | 58.0 | 14.0 | 21.0 | 11.0 | 20.0 | ... | 163.0 | 167.0 | 95.0 | 284.0 | 119.0 | 306.0 | 116.0 | 546.0 | 240.0 | MULTIPOLYGON (((-121.91125 37.35418, -121.9108... |
| 3 | 14000US06085500300 | Census Tract 5003, Santa Clara, CA | 1540.0 | 185.0 | 33.0 | 27.0 | 0.0 | 13.0 | 13.0 | 24.0 | ... | 153.0 | 136.0 | 64.0 | 89.0 | 49.0 | 342.0 | 104.0 | 358.0 | 98.0 | MULTIPOLYGON (((-121.92189 37.34604, -121.9218... |
| 4 | 14000US06085500400 | Census Tract 5004, Santa Clara, CA | 1049.0 | 113.0 | 27.0 | 19.0 | 5.0 | 8.0 | 40.0 | 29.0 | ... | 74.0 | 72.0 | 41.0 | 90.0 | 50.0 | 113.0 | 58.0 | 217.0 | 55.0 | MULTIPOLYGON (((-121.93034 37.33913, -121.9303... |
| 5 | 14000US06085500500 | Census Tract 5005, Santa Clara, CA | 2213.0 | 282.0 | 28.0 | 27.0 | 153.0 | 89.0 | 38.0 | 34.0 | ... | 62.0 | 319.0 | 266.0 | 217.0 | 94.0 | 281.0 | 95.0 | 646.0 | 140.0 | MULTIPOLYGON (((-121.94078 37.32835, -121.9407... |
5 rows × 37 columns
print(income_geo_df.columns)
col_to_keep = ['geoid', 'name', 'B19001001', 'B19001002',
'B19001003', 'B19001004','B19001005', 'B19001006',
'B19001007', 'B19001008','B19001009', 'B19001010',
'B19001011', 'B19001012','B19001013', 'B19001014',
'B19001015', 'B19001016','B19001017', 'geometry']
income_geo_df = income_geo_df[col_to_keep]
income_geo_df.head()
Index(['geoid', 'name', 'B19001001', 'B19001001, Error', 'B19001002',
'B19001002, Error', 'B19001003', 'B19001003, Error', 'B19001004',
'B19001004, Error', 'B19001005', 'B19001005, Error', 'B19001006',
'B19001006, Error', 'B19001007', 'B19001007, Error', 'B19001008',
'B19001008, Error', 'B19001009', 'B19001009, Error', 'B19001010',
'B19001010, Error', 'B19001011', 'B19001011, Error', 'B19001012',
'B19001012, Error', 'B19001013', 'B19001013, Error', 'B19001014',
'B19001014, Error', 'B19001015', 'B19001015, Error', 'B19001016',
'B19001016, Error', 'B19001017', 'B19001017, Error', 'geometry'],
dtype='object')
| geoid | name | B19001001 | B19001002 | B19001003 | B19001004 | B19001005 | B19001006 | B19001007 | B19001008 | B19001009 | B19001010 | B19001011 | B19001012 | B19001013 | B19001014 | B19001015 | B19001016 | B19001017 | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 14000US06085500100 | Census Tract 5001, Santa Clara, CA | 2689.0 | 85.0 | 150.0 | 75.0 | 36.0 | 108.0 | 51.0 | 54.0 | 80.0 | 54.0 | 96.0 | 55.0 | 234.0 | 183.0 | 324.0 | 376.0 | 728.0 | MULTIPOLYGON (((-121.90463 37.36078, -121.9044... |
| 2 | 14000US06085500200 | Census Tract 5002, Santa Clara, CA | 1968.0 | 57.0 | 14.0 | 11.0 | 0.0 | 0.0 | 18.0 | 11.0 | 21.0 | 64.0 | 49.0 | 105.0 | 315.0 | 167.0 | 284.0 | 306.0 | 546.0 | MULTIPOLYGON (((-121.91125 37.35418, -121.9108... |
| 3 | 14000US06085500300 | Census Tract 5003, Santa Clara, CA | 1540.0 | 33.0 | 0.0 | 13.0 | 35.0 | 21.0 | 33.0 | 26.0 | 19.0 | 46.0 | 19.0 | 77.0 | 293.0 | 136.0 | 89.0 | 342.0 | 358.0 | MULTIPOLYGON (((-121.92189 37.34604, -121.9218... |
| 4 | 14000US06085500400 | Census Tract 5004, Santa Clara, CA | 1049.0 | 27.0 | 5.0 | 40.0 | 5.0 | 27.0 | 33.0 | 33.0 | 15.0 | 24.0 | 55.0 | 77.0 | 216.0 | 72.0 | 90.0 | 113.0 | 217.0 | MULTIPOLYGON (((-121.93034 37.33913, -121.9303... |
| 5 | 14000US06085500500 | Census Tract 5005, Santa Clara, CA | 2213.0 | 28.0 | 153.0 | 38.0 | 106.0 | 51.0 | 17.0 | 45.0 | 36.0 | 15.0 | 22.0 | 91.0 | 148.0 | 319.0 | 217.0 | 281.0 | 646.0 | MULTIPOLYGON (((-121.94078 37.32835, -121.9407... |
# Combine the mata json file to get variable names for the columns
import json
with open('data/metadata.json', 'r') as f:
meta_data = json.load(f)
f.close()
meta_data['tables']['B19001']['columns']
{'B19001001': {'name': 'Total:', 'indent': 0},
'B19001002': {'name': 'Less than $10,000', 'indent': 1},
'B19001003': {'name': '$10,000 to $14,999', 'indent': 1},
'B19001004': {'name': '$15,000 to $19,999', 'indent': 1},
'B19001005': {'name': '$20,000 to $24,999', 'indent': 1},
'B19001006': {'name': '$25,000 to $29,999', 'indent': 1},
'B19001007': {'name': '$30,000 to $34,999', 'indent': 1},
'B19001008': {'name': '$35,000 to $39,999', 'indent': 1},
'B19001009': {'name': '$40,000 to $44,999', 'indent': 1},
'B19001010': {'name': '$45,000 to $49,999', 'indent': 1},
'B19001011': {'name': '$50,000 to $59,999', 'indent': 1},
'B19001012': {'name': '$60,000 to $74,999', 'indent': 1},
'B19001013': {'name': '$75,000 to $99,999', 'indent': 1},
'B19001014': {'name': '$100,000 to $124,999', 'indent': 1},
'B19001015': {'name': '$125,000 to $149,999', 'indent': 1},
'B19001016': {'name': '$150,000 to $199,999', 'indent': 1},
'B19001017': {'name': '$200,000 or more', 'indent': 1}}
col_name_dict = {'B19001001': 'Total',
'B19001002': 'Less than $10,000',
'B19001003': '10,000 to $14,999',
'B19001004': '15,000 to $19,999',
'B19001005': '20,000 to $24,999',
'B19001006': '25,000 to $29,999',
'B19001007': '30,000 to $34,999',
'B19001008': '35,000 to $39,999',
'B19001009': '40,000 to $44,999',
'B19001010': '45,000 to $49,999',
'B19001011': '50,000 to $59,999',
'B19001012': '60,000 to $74,999',
'B19001013': '75,000 to $99,999',
'B19001014': '100,000 to $124,999',
'B19001015': '125,000 to $149,999',
'B19001016': '150,000 to $199,999',
'B19001017': '200,000 or more'}
income_geo_df.rename(columns = col_name_dict, inplace = True)
income_geo_df.head()
| geoid | name | Total | Less than $10,000 | 10,000 to $14,999 | 15,000 to $19,999 | 20,000 to $24,999 | 25,000 to $29,999 | 30,000 to $34,999 | 35,000 to $39,999 | 40,000 to $44,999 | 45,000 to $49,999 | 50,000 to $59,999 | 60,000 to $74,999 | 75,000 to $99,999 | 100,000 to $124,999 | 125,000 to $149,999 | 150,000 to $199,999 | 200,000 or more | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 14000US06085500100 | Census Tract 5001, Santa Clara, CA | 2689.0 | 85.0 | 150.0 | 75.0 | 36.0 | 108.0 | 51.0 | 54.0 | 80.0 | 54.0 | 96.0 | 55.0 | 234.0 | 183.0 | 324.0 | 376.0 | 728.0 | MULTIPOLYGON (((-121.90463 37.36078, -121.9044... |
| 2 | 14000US06085500200 | Census Tract 5002, Santa Clara, CA | 1968.0 | 57.0 | 14.0 | 11.0 | 0.0 | 0.0 | 18.0 | 11.0 | 21.0 | 64.0 | 49.0 | 105.0 | 315.0 | 167.0 | 284.0 | 306.0 | 546.0 | MULTIPOLYGON (((-121.91125 37.35418, -121.9108... |
| 3 | 14000US06085500300 | Census Tract 5003, Santa Clara, CA | 1540.0 | 33.0 | 0.0 | 13.0 | 35.0 | 21.0 | 33.0 | 26.0 | 19.0 | 46.0 | 19.0 | 77.0 | 293.0 | 136.0 | 89.0 | 342.0 | 358.0 | MULTIPOLYGON (((-121.92189 37.34604, -121.9218... |
| 4 | 14000US06085500400 | Census Tract 5004, Santa Clara, CA | 1049.0 | 27.0 | 5.0 | 40.0 | 5.0 | 27.0 | 33.0 | 33.0 | 15.0 | 24.0 | 55.0 | 77.0 | 216.0 | 72.0 | 90.0 | 113.0 | 217.0 | MULTIPOLYGON (((-121.93034 37.33913, -121.9303... |
| 5 | 14000US06085500500 | Census Tract 5005, Santa Clara, CA | 2213.0 | 28.0 | 153.0 | 38.0 | 106.0 | 51.0 | 17.0 | 45.0 | 36.0 | 15.0 | 22.0 | 91.0 | 148.0 | 319.0 | 217.0 | 281.0 | 646.0 | MULTIPOLYGON (((-121.94078 37.32835, -121.9407... |
# Check dataset integrity
check_tract = income_geo_df.sample()
print(f"total {check_tract['Total'].values[0]} equals column sum {sum(check_tract.iloc[0][3:19].values)}")
total 531.0 equals column sum 531.0
income_geo_df.describe()
| Total | Less than $10,000 | 10,000 to $14,999 | 15,000 to $19,999 | 20,000 to $24,999 | 25,000 to $29,999 | 30,000 to $34,999 | 35,000 to $39,999 | 40,000 to $44,999 | 45,000 to $49,999 | 50,000 to $59,999 | 60,000 to $74,999 | 75,000 to $99,999 | 100,000 to $124,999 | 125,000 to $149,999 | 150,000 to $199,999 | 200,000 or more | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 | 1581.000000 |
| mean | 803.686275 | 22.870335 | 18.547755 | 16.516129 | 15.776091 | 15.537002 | 16.750158 | 15.643264 | 16.651486 | 15.745731 | 31.500316 | 50.013915 | 75.501581 | 74.116382 | 66.444023 | 108.766603 | 243.305503 |
| std | 567.312353 | 32.397125 | 30.948565 | 27.424959 | 23.879268 | 23.470546 | 24.812465 | 23.327075 | 24.496486 | 23.023045 | 35.760961 | 51.987277 | 70.487773 | 67.151532 | 66.675980 | 99.144036 | 231.221927 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 416.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 14.000000 | 25.000000 | 26.000000 | 21.000000 | 42.000000 | 88.000000 |
| 50% | 603.000000 | 11.000000 | 7.000000 | 6.000000 | 8.000000 | 7.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 20.000000 | 35.000000 | 55.000000 | 55.000000 | 47.000000 | 78.000000 | 179.000000 |
| 75% | 999.000000 | 33.000000 | 24.000000 | 22.000000 | 22.000000 | 21.000000 | 23.000000 | 23.000000 | 23.000000 | 22.000000 | 47.000000 | 70.000000 | 106.000000 | 103.000000 | 92.000000 | 143.000000 | 312.000000 |
| max | 5556.000000 | 405.000000 | 254.000000 | 282.000000 | 220.000000 | 216.000000 | 217.000000 | 314.000000 | 169.000000 | 224.000000 | 275.000000 | 371.000000 | 535.000000 | 461.000000 | 720.000000 | 1207.000000 | 2174.000000 |
income_geo_df['Total'].hist(bins = 100)
plt.show()
sns.kdeplot(income_geo_df['Total'], shade = True, color = 'purple')
plt.show()
print(f"There are {len(income_geo_df[income_geo_df['Total'] == 0])} census blocks without households in them")
There are 2 census blocks without households in them
We want to check the distribution of all the columns of interest and can do so with a Ridgeplot as well as box plot.
# want a Ridgeplot of all distributions
# melt DF to prep for visualization
cols = income_geo_df.describe().columns.values
melted_df = pd.melt(income_geo_df[cols[1:]])
melted_df
| variable | value | |
|---|---|---|
| 0 | Less than $10,000 | 85.0 |
| 1 | Less than $10,000 | 57.0 |
| 2 | Less than $10,000 | 33.0 |
| 3 | Less than $10,000 | 27.0 |
| 4 | Less than $10,000 | 28.0 |
| ... | ... | ... |
| 25291 | 200,000 or more | 0.0 |
| 25292 | 200,000 or more | 0.0 |
| 25293 | 200,000 or more | 47.0 |
| 25294 | 200,000 or more | 14.0 |
| 25295 | 200,000 or more | 63.0 |
25296 rows × 2 columns
We see that the distributions for "Less than 10,000" to "45,000 to \$49,999" are fairly similar and in general there are quite a lot of outliers, especially for "200,000 or more". Making a ridgeplot from '50,000 to $59,999' to '200,000 or more' might be interesting to look at.
plt.figure(figsize = (15, 10))
sns.boxplot(data = income_geo_df[cols[1:]])
plt.xticks(rotation = 90)
plt.show()
import numpy as np
sns.set_theme(style = "white")
plt.figure(figsize = (15, 10))
sns.kdeplot(data = income_geo_df[cols[1:]], fill = True, alpha = .05, legend = True)
plt.xticks(rotation = 90)
plt.xscale('log')
plt.xlabel("Log scale households across block areas")
plt.ylabel("Density of Counts")
plt.title("Counts of number of households within an income bracket across blocks in Santa Clara County")
plt.show()
sns.kdeplot(income_geo_df['200,000 or more'], shade = True, color = 'orange')
plt.show()
sns.kdeplot(income_geo_df['Less than $10,000'], shade = True, color = 'green')
plt.show()
household_income_norm = income_geo_df.copy()
for col in cols[1:]:
household_income_norm[col] = (income_geo_df[col]/income_geo_df['Total']) *100
household_income_norm.head()
| geoid | name | Total | Less than $10,000 | 10,000 to $14,999 | 15,000 to $19,999 | 20,000 to $24,999 | 25,000 to $29,999 | 30,000 to $34,999 | 35,000 to $39,999 | 40,000 to $44,999 | 45,000 to $49,999 | 50,000 to $59,999 | 60,000 to $74,999 | 75,000 to $99,999 | 100,000 to $124,999 | 125,000 to $149,999 | 150,000 to $199,999 | 200,000 or more | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 14000US06085500100 | Census Tract 5001, Santa Clara, CA | 2689.0 | 3.161026 | 5.578282 | 2.789141 | 1.338788 | 4.016363 | 1.896616 | 2.008181 | 2.975084 | 2.008181 | 3.570100 | 2.045370 | 8.702120 | 6.805504 | 12.049089 | 13.982893 | 27.073261 | MULTIPOLYGON (((-121.90463 37.36078, -121.9044... |
| 2 | 14000US06085500200 | Census Tract 5002, Santa Clara, CA | 1968.0 | 2.896341 | 0.711382 | 0.558943 | 0.000000 | 0.000000 | 0.914634 | 0.558943 | 1.067073 | 3.252033 | 2.489837 | 5.335366 | 16.006098 | 8.485772 | 14.430894 | 15.548780 | 27.743902 | MULTIPOLYGON (((-121.91125 37.35418, -121.9108... |
| 3 | 14000US06085500300 | Census Tract 5003, Santa Clara, CA | 1540.0 | 2.142857 | 0.000000 | 0.844156 | 2.272727 | 1.363636 | 2.142857 | 1.688312 | 1.233766 | 2.987013 | 1.233766 | 5.000000 | 19.025974 | 8.831169 | 5.779221 | 22.207792 | 23.246753 | MULTIPOLYGON (((-121.92189 37.34604, -121.9218... |
| 4 | 14000US06085500400 | Census Tract 5004, Santa Clara, CA | 1049.0 | 2.573880 | 0.476644 | 3.813155 | 0.476644 | 2.573880 | 3.145853 | 3.145853 | 1.429933 | 2.287893 | 5.243089 | 7.340324 | 20.591039 | 6.863680 | 8.579600 | 10.772164 | 20.686368 | MULTIPOLYGON (((-121.93034 37.33913, -121.9303... |
| 5 | 14000US06085500500 | Census Tract 5005, Santa Clara, CA | 2213.0 | 1.265251 | 6.913692 | 1.717126 | 4.789878 | 2.304564 | 0.768188 | 2.033439 | 1.626751 | 0.677813 | 0.994126 | 4.112065 | 6.687754 | 14.414822 | 9.805694 | 12.697695 | 29.191143 | MULTIPOLYGON (((-121.94078 37.32835, -121.9407... |
break up the incomes per boxplot distribution spead to get a better look at density
# Get higher income distribution df made
melted_df_higher = pd.melt(household_income_norm[cols[-7:]])
# Get Lower income distribtion df made
melted_df_lower = pd.melt(household_income_norm[cols[1:10]])
def ridgeplot(df, title, title_ht):
"""
df : dataframe, the data to use
title : string, the title of the plot
return : a ridgeplot
"""
sns.set_theme(style = "white", rc = {"axes.facecolor": (0,0,0,0)}) # transparent background
palette = sns.color_palette("Set2", 12)
g = sns.FacetGrid(df,
palette = palette,
hue = 'variable',
row = 'variable',
aspect = 7.5,
height = 1.8)
g.map_dataframe(sns.kdeplot, x = 'value', fill = True, alpha = .5)
g.map_dataframe(sns.kdeplot, x = 'value', color = 'black')
# Create Labels
def create_label(x, color, label):
ax = plt.gca()
ax.text(1, .02, label, color = 'black', fontsize = 13,
ha = 'right', transform = ax.transAxes)
g.map(create_label, 'variable')
g.fig.subplots_adjust(hspace = -.6)
g.set_titles("")
g.set(yticks =[], xlabel = 'household counts across geographic blocks')
g.despine(left = True)
plt.suptitle(title, y = title_ht, x = .5)
plt.show()
ridgeplot(melted_df_higher,
"Upper Income Household Counts across block groups for Santa Clara County, CA",
.98)
ridgeplot(melted_df_lower,
"Lower Income Household Counts across block groups for Santa Clara County, CA",
.93)
Packages
Choose a column to visualize, though in a dashboard, we can choose a different income column to visualize using a dropdown menu.
sns.set_theme(style = 'white')
household_income_norm.plot(figsize=(12, 10),
column = "Less than $10,000",
legend = True,
scheme = "NaturalBreaks")
plt.show()
below we can see that there is a small concentrated population of households that live in the center of the county and then in the large county out to the right. This area to the right is rural land, and we tend to think everything is spread out in rural country so access to farmer's market will require transportation likely by personal vehicle.
sns.set_theme(style = 'white')
fig, ax = plt.subplots(2, 1, figsize = (20, 20))
# Subplots
ax1, ax2 = ax
# Low Income households
household_income_norm.plot(column = "Less than $10,000",
cmap = "gist_earth_r",
k = 5,
edgecolor = 'white',
linewidth = 0,
alpha = 0.5,
ax = ax1,
legend = True,
scheme = "quantiles")
ax1.axis('off')
ax1.set_title("Income: Less than $10,000")
# Higher Income households
household_income_norm.plot(column = "200,000 or more",
cmap = "gist_earth_r",
k = 5,
edgecolor = 'white',
linewidth = 0,
alpha = 0.5,
ax = ax2,
legend = True,
scheme = "quantiles")
ax2.axis('off')
ax2.set_title("Income: 200,000 or more")
plt.show()
We are going to create new features of the data by combining column data followed by normalizing. Based on California data
so we will combine:
income_geo_df['Less than $20,000'] = income_geo_df[cols[1:4]].sum(axis = 1)
income_geo_df['Between 20,000 and $59,999'] = income_geo_df[cols[4:11]].sum(axis = 1)
income_geo_df['Between 60,000 and $124,999'] = income_geo_df[cols[11:14]].sum(axis = 1)
income_geo_df['Greater than $125,000'] = income_geo_df[cols[14:]].sum(axis = 1)
new_feat = ['Less than $20,000','Between 20,000 and $59,999',
'Between 60,000 and $124,999','Greater than $125,000']
household_income_norm2 = income_geo_df.copy()
for col in new_feat:
household_income_norm2[col] = (income_geo_df[col]/income_geo_df['Total']) *100
household_income_norm2.head()
| geoid | name | Total | Less than $10,000 | 10,000 to $14,999 | 15,000 to $19,999 | 20,000 to $24,999 | 25,000 to $29,999 | 30,000 to $34,999 | 35,000 to $39,999 | ... | 75,000 to $99,999 | 100,000 to $124,999 | 125,000 to $149,999 | 150,000 to $199,999 | 200,000 or more | geometry | Less than $20,000 | Between 20,000 and $59,999 | Between 60,000 and $124,999 | Greater than $125,000 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 14000US06085500100 | Census Tract 5001, Santa Clara, CA | 2689.0 | 85.0 | 150.0 | 75.0 | 36.0 | 108.0 | 51.0 | 54.0 | ... | 234.0 | 183.0 | 324.0 | 376.0 | 728.0 | MULTIPOLYGON (((-121.90463 37.36078, -121.9044... | 11.528449 | 17.813313 | 17.552994 | 53.105244 |
| 2 | 14000US06085500200 | Census Tract 5002, Santa Clara, CA | 1968.0 | 57.0 | 14.0 | 11.0 | 0.0 | 0.0 | 18.0 | 11.0 | ... | 315.0 | 167.0 | 284.0 | 306.0 | 546.0 | MULTIPOLYGON (((-121.91125 37.35418, -121.9108... | 4.166667 | 8.282520 | 29.827236 | 57.723577 |
| 3 | 14000US06085500300 | Census Tract 5003, Santa Clara, CA | 1540.0 | 33.0 | 0.0 | 13.0 | 35.0 | 21.0 | 33.0 | 26.0 | ... | 293.0 | 136.0 | 89.0 | 342.0 | 358.0 | MULTIPOLYGON (((-121.92189 37.34604, -121.9218... | 2.987013 | 12.922078 | 32.857143 | 51.233766 |
| 4 | 14000US06085500400 | Census Tract 5004, Santa Clara, CA | 1049.0 | 27.0 | 5.0 | 40.0 | 5.0 | 27.0 | 33.0 | 33.0 | ... | 216.0 | 72.0 | 90.0 | 113.0 | 217.0 | MULTIPOLYGON (((-121.93034 37.33913, -121.9303... | 6.863680 | 18.303146 | 34.795043 | 40.038132 |
| 5 | 14000US06085500500 | Census Tract 5005, Santa Clara, CA | 2213.0 | 28.0 | 153.0 | 38.0 | 106.0 | 51.0 | 17.0 | 45.0 | ... | 148.0 | 319.0 | 217.0 | 281.0 | 646.0 | MULTIPOLYGON (((-121.94078 37.32835, -121.9407... | 9.896069 | 13.194758 | 25.214641 | 51.694532 |
5 rows × 24 columns
# new Kdeplots
kdeplot_df = household_income_norm2[new_feat]
kdeplot_df_melt = pd.melt(kdeplot_df)
ridgeplot(kdeplot_df_melt,
"Household Counts across block groups for Santa Clara County, CA",
.93)
sns.set_theme(style = 'white')
fig, ax = plt.subplots(2, 1, figsize = (20, 20))
# Subplots
ax1, ax2 = ax
# Low Income households
household_income_norm2.plot(column = 'Less than $20,000',
cmap = "gist_earth_r",
k = 5,
edgecolor = 'white',
linewidth = 0,
alpha = 0.5,
ax = ax1,
legend = True,
scheme = "quantiles")
ax1.axis('off')
ax1.set_title("Income: Less than $10,000")
# Higher Income households
household_income_norm2.plot(column = 'Greater than $125,000',
cmap = "gist_earth_r",
k = 5,
edgecolor = 'white',
linewidth = 0,
alpha = 0.5,
ax = ax2,
legend = True,
scheme = "quantiles")
ax2.axis('off')
ax2.set_title("Income: 125,000 or more")
plt.show()
Let's get more explicit with thresholding the data
threshold = 30
col_name = 'Between 20,000 and $59,999'
norm_web_merc = household_income_norm2.to_crs(epsg = 3857)
fig, ax = plt.subplots(figsize = (20, 20))
norm_web_merc[norm_web_merc[col_name] > threshold].plot(ax = ax, alpha = 0.8)
ax.axis('off')
ax.set_title(f"Census Blocks with more than {threshold}% of the households are {col_name}")
ctx.add_basemap(ax)